A Method for 21 cm Power Spectrum Estimation in the Presence of Foregrounds 



Adrian Liij^] and Max Tegmark 
Dept. of Physics and MIT Kavli Institute, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 

(Dated: January 13, 2013) 

21 cm tomography promises to be a powerful tool for estimating cosmological parameters, con- 
straining the epoch of reionization, and probing the so-called dark ages. However, realizing this 
promise will require the extraction of a cosmological power spectrum from beneath overwhelmingly 
large sources of foreground contamination. In this paper, we develop a unified matrix-based frame- 
work for foreground subtraction and power spectrum estimation, which allows us to quantify the 
errors and biases that arise in the power spectrum as a result of foreground subtraction. We find 
that existing line-of-sight foreground subtraction proposals can lead to substantial mode-mixing 
as well as residual noise and foreground biases, whereas our proposed inverse variance foreground 
subtraction eliminates noise and foreground biases, gives smaller error bars, and produces less cor- 
related measurements of the power spectrum. We also numerically confirm the intuitive belief in 
the literature that 21cm foreground subtraction is best done using frequency rather than angular 
information. 

PACS numbers: 95.75.-z,98.80.-k,95.75.Pq,98.80.Es 



I. INTRODUCTION 

In recent years, 21 cm tomography has been hailed 
as a cosmological probe with the potential to overtake 
Cosmic Microwave Background (CMB) measurements as 
the most promising tool for understanding our Universe. 
Much of this is due to the sheer information content avail- 
able — as an optically thin transition of the most abun- 
dant element in the Universe, the redshifted 21 cm hyper- 
fine transition line allows one to probe a much larger vol- 
ume of our Universe than ever before. This is projected 
to lead to better constraints on cosmological parameters 
related to inflation, dark matter, dark energy, and neu- 
trinos [T]-[8]. Moreover, 21cm tomography may be our 
only direct probe of the Epoch of Reionization (EoR) as 
well as the preceding cosmological dark ages [S^'-TTSl. 

Despite its great potential, several difficulties must be 
overcome for 21 cm tomography to become a useful scien- 
tific tool. That the cosmological signal is expected to be 
on the order of 10 mK in brightness temperature means 
that sensitivity requirements for a mere detection will 
be high, especially given that foreground contamination 
of various origins (such as unresolved extragalactic point 
sources, resolved point sources, and Galactic synchrotron 
emission) are expected to contribute a total contami- 
nating effect in the range of hundreds of Kelvins [19]. 
As a result, the first-generation of arrays currently be- 
ing tested for cosmological 21 cm measurements (such as 
PAPER 120], LOFAR [IJ, MWA [22], CRT [23], GMRT 
|24j, and the Omniscope [25^) are not expected to be ca- 
pable of producing images of the small ionized regions 
that formed around the first luminous sources during the 
EoR. Instead, any cosmological results coming from these 
arrays will likely be based on large scale statistical mea- 
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sures. One example would be the so-called global step 
measurements that seek to measure the evolution of the 
average 21cm signal of the whole sky [26 . Going be- 
yond this, one may seek to detect the spatial fluctua- 
tions of the signal and to estimate the power spectrum. 
The power spectrum, in particular, has thus far been 
shown to be the most promising way to leverage 21 cm 
tomography for precision cosmology, potentially placing 
extremely tight constraints on parameters such as the 
neutrino mass, the dark energy equation of state, and 
spatial curvature JJ\ jH [27] . 

Power spectrum estimation in 21cm tomography has 
similarities to and differences from power spectrum es- 
timation for both the CMB and galaxy surveys. As it 
is for the CMB, foreground contamination is a serious 
concern, especially since the foregrounds are so strong at 
the lower frequencies (--100 - 200 MHz) that EoR ex- 
periments target. Unlike the CMB, however, 21cm to- 
mography probes a three-dimensional volume (just like 
with galaxy surveys), so the final quantity of interest is 
not an angular power spectrum but a three-dimensional 
power spectrum. Ultimately, one wishes to obtain the 
spherically averaged matter power spectrum as a func- 
tion of redshift P(A:,z), because one expects the cosmo- 
logical power to be isotropic. However, it is often prefer- 
able to first form a cylindrically averaged power spec- 
trum P{k^^k\\] z)^ since the isotropy may be destroyed 
by redshift-space distortions [28], the Alcock-Pacyznski 
effect [29H3T] , as well as residual foregrounds that arise 
from imperfect foreground removal. 

The problem of foreground removal in 21 cm tomogra- 
phy has been extensively studied in the literature. Early 
foreground cleaning proposals focused on using the an- 
gular structure of the foregrounds |32H36] , while recent 
studies have suggested that a line-of-sight (LOS) spectral 
subtraction method may be the most promising approach 
j37H44j . Though the detailed algorithmic implementa- 
tions of the LOS method vary amongst these studies. 
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the core idea is always to take advantage of the smooth 
spectral behavior of foregrounds to separate them from 
any cosmological signals, which are expected to fluctu- 
ate rapidly with frequency. Simulations of this approach 
have confirmed it as a successful foreground cleaning 
technique in the sense that the post-subtraction fore- 
ground residuals can be suppressed to a level smaller 
than the expected amplitude of the cosmological sig- 
nal. Thus, foreground contamination is unlikely to be 
an insurmountable obstacle to the initial detection of the 
21 cm cosmological signal. 

What remains unclear, however, is whether LOS sub- 
traction techniques adversely affect the quality of one's 
final estimate of the cosmological power spectrum. For 
instance, although spectrally smooth components of the 
cosmological signal are expected to be small, if any are 
present they will be inadvertently subtracted off with the 
foregrounds when LOS cleaning is performed. LOS meth- 
ods are thus not lossless, in the sense that their destruc- 
tion of cosmological information leads to a degradation 
of error bars on the final power spectra (for a more rig- 
orous definition of information loss, see [4F or Section 
[ll| of this paper for a quick review). Foreground sub- 
traction may also lead to residual systematic noise and 
foreground biases in estimates of the power spectrum, a 
fact that was explored numerically using simulations in 
[39l EH |46]. Finally, foreground subtraction along the 
LOS may lead to correlated errors in power spectrum 
measurements, which can limit their usefulness for esti- 
mating cosmological parameters. 

The goal of this paper is to adapt the existing math- 
ematical formalism for power spectrum estimation from 
the CMB and galaxy survey literature in a way that not 
only respects the unique geometrical properties of 21 cm 
tomography experiments, but also allows us to robustly 
deal with the issue of foreground contamination. This 
will permit a more complete analysis of the errors in- 
volved in estimating 21 cm power spectra, in a way that 
automatically incorporates the effect that foreground re- 
moval has on the final results. In particular, we will be 
able to quantify not just the errors on the power spec- 
trum itself (the "vertical" error bars), but also the cor- 
relations between the parts of the power spectrum (the 
"horizontal" error bars, or equivalently the power spec- 
trum window functions^). In addition, we will quantify 
the noise and foreground biases that are introduced by 
foreground subtraction and power spectrum estimation 
so that such biases can be systematically removed. We 



accomplish all this by considering the foregrounds not as 
an additional signal to be removed, but rather as a form 
of correlated noise. Doing so in our formalism allows us 
to build on the numerical simulation work of [39^, [4T, 
and [46 to analytically show how LOS foreground sub- 
traction methods are lossy and leave residual noise and 
foreground biases. The formalism also naturally suggests 
an alternative method for foreground subtraction that 
leads to smaller error bars and eliminates the residual 
biases. 

The rest of this paper is organized as follows. In Sec- 
tion [TTj we introduce the mathematical notation and for- 
malism that we use for power spectrum estimation. Ex- 
isting LOS methods are cast in this language, and in addi- 
tion we introduce our alternative foreground subtraction 
scheme. In Section [IV| we test our methods on a specific 
foreground model, which we describe in Section [LQI Pro- 
vided that the foregrounds satisfy certain generic criteria 
(such as having a smooth frequency dependence), the 
details of the foreground model should have no effect on 
our qualitative conclusions. In Section [V| we show how 
these qualitative conclusions can be understood through 
a simple toy model. We summarize our conclusions and 



discuss broader implications in Section VI 



II. THE MATHEMATICAL FRAMEWORK 

In this section we review the power spectrum esti- 
mation formalism introduced by [45^, and adapt it for 
21 cm tomography. Readers interested in the mathemat- 
ical details of power spectrum estimation are encouraged 
to peruse [4^ 49-51, . Our development builds on the 
quadratic methods presented in [45l [50] . The goal is to 
derive new expressions appropriate to 21 cm power spec- 
trum esti mation (such as Equation [s] and the expressions 
in Section II C| ) as well as to systematically place differ- 
ent power spectrum estimation methods in a common 
matrix-based framework. 



A. Quadratic Estimators 

Our objective is to use the measured 21 cm brightness 
temperature distribution Tij{t) to estimate a cylindrically 
symmetric power spectrum^ PT{k±^ k\\) which is conven- 
tionally defined by the equation 

(fb(k)*ff,(k')) = {27rfS{k-k')PT{k^,k«), (1) 



These are not to be confused with the instrumental window func- 
tions that are conventionally defined in radio astronomy to quan- 
tify the effect of an instrument's beam and noise properties (and 
as used, for instance, in related 21 cm power spectrum papers 
such as [47] and [48]). In this paper, the "window function" can 
be thought of as a convolution kernel that produces the mea- 
sured power spectrum when applied to the true power spectrum. 
Please see Section [TTl for a rigorous mathematical definition. 



^ Note that in this paper, we are not examining the steps re- 
quired to estimate the matter power spectrum. Instead, we 
are focussing on estimating the temperature power spectrum 
PT(k±, A^ii), which is an important first step for finding the mat- 
ter power spectrum. For details on how one might extract a 
matter power spectrum from a temperature power spectrum, see 
for example [41128]. 



3 



where T5 is the three-dimensional spatial Fourier trans- 
form of T5, and S is the Dirac delta function. As written, 
this definition contains the continuous functions T5 and 
Pt. However, any practical numerical scheme for esti- 
mating the power spectrum must necessarily be discrete. 
The simplest way to do this is to divide the measured 
spatial distribution of brightness temperature into dis- 
crete voxels, so that T5 takes the form of a data vector 
X that is essentially a list of the brightness temperatures 
measured at various points on a three-dimensional grid. 
Note that in forming this vector, one should not use the 
full radial extent of the data that are available in a typ- 
ical 21cm tomography experiment. Instead, one should 
split the full data into multiple x vectors, each of which 
spans only a very narrow range in redshift so that the 
evolution of the cosmological signal is negligible. The 
analysis that we describe in this paper should then be 
performed separately on each vector to produce a number 
of power spectra at different redshifts. Failure to do so 
would violate the assumption of translational invariance 
that was implicit when we defined the power spectrum 
using Equation [1] 

To discretize the power spectrum, we parametrize it as 
a piecewise constant function^, such that PT{k±^k\\) = 

Pab for <k^ < and kl < k\\ < If the index 

a runs over M different values and the index b runs over 
N values, the power spectrum can then be stored in an 
MA/'-dimensional vector p^^ where index pairs (a, b) have 
been folded into a single index a. Parameterizing the 
power spectrum in this way (where each component Pa 
is referred to as the band power of the band a) represents 
no significant loss of information as long as the widths of 
the k bins are small compared to the physical scales over 
which Pt varies appreciably [50] , 

A quadratic method for estimating our discretized 
power spectrum is one where the estimator of Pa takes 
the quadratic form 



Pa 



(x-m)^E"(x-m) -60 



(2) 



for some family of symmetric matrices and constants 
ba (one for each /c-region in our piecewise constant dis- 
cretized power spectrum). The vector m is defined as 
the ensemble average over different random realizations 
of the random variable data vector x, i.e. m = (x). The 
hat (^) denotes the fact that what we have here is an es- 
timate of the true power spectrum p^ from the data. The 
matrix encodes the Fourier transforms, binning, and 
— crucially — the weighting and foreground subtraction 



^ In a practical application of the methods described in this pa- 
per, it is often preferable to instead parametrize as a piecewise 
constant function the ratio of the power spectrum to a prior. As 
shown in [52J, doing so tends to give better behaved window func- 
tions (Equation [sj. Here for simplicity we employ a white {i.e. 
constant) prior simply because there are as yet no observational 
constraints on the form of the 21 cm power spectrum. 



steps required in going from the calibrated data vector x 
to the corresponding estimate of the power spectrum Pa . 
For example, if one chooses to forgo any form of fore- 
ground subtraction and to form a power spectrum using 
a completely uniform weighting of all voxels, the matrix 
takes the form 



^ij 



—{^,a)ij 
no fg 



I 



ik.(r,-r,) 



(27r 



' ijj ij 



sin k 



b' ij 



(2^)3 



sinA:;^'_ir;^ ) x 



i^a^ij'^li^a^ij) ^a-l^ij'^li^a-l^ij)] ' 



(3) 



where r • = r 

''J 



• is the radial line-of-sight distance be- 
tween spatial grid points and r^, r:^ = |r^| = — 

is the projected perpendicular distance, cpk = arccos(k^ • 
f^) is the angle between and r^-, Vj^ is the volume 
in Fourier space of the a-th bin, and Ji is the 1^^ Bessel 
function of the first kind. (Recall that indices a and b 
specify the k± and k\\ bins, respectively, and are folded 
into the single index a). The notation C^a is intended 
to be suggestive of the connection between the correla- 
tion function (or covariance matrix) C of the data and 
its discretized power spectrum Po,: 



C = ((x-m)(x-m)*)=C/,+N- 



E 



PaC^a^ (4) 



where in the last equality we were able to take advan- 
tage of the fact that the foregrounds, instrumental noise, 
and signal are uncorrelated to write the total covariance 
as the sum of individual covariance contributions {Cfg 
for the foregrounds and N for the instrumental noise) 
and the contribution from the cosmological signal (the 
final term) with no cross-terms. In this form, we see that 
C^a = dC/dpa is simply the derivative of the covariance 
with respect to the band power. Intuitively, the last term 
in Equation [4] is simply an expansion of the correlation 
function of the cosmological function in binned Fourier 
modes. As a more familiar example, consider a situation 
where one is trying to estimate the three-dimensional 
power spectrum PtO^) instead of Pt(^±,^||)- In such 
a case, there would be no spherical or cylindrical bin- 
ning in forming the power spectrum, and we would have 
{^,a)ij ^ g*k« (rj-ri) Equation [i] then simply reduces 
to the well-known fact that the power spectrum is the 
Fourier representation of the correlation function. 

Different choices for the matrix will give power 
spectrum estimates with different statistical properties. 
A particularly desirable property to have is for our power 
spectrum estimate to be free from noise/foreground bias, 
so that the final estimator Pa depends only on the cos- 
mological power and not on noise or foregrounds. Tak- 
ing the expectation value of Equation |2] and substituting 
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Equation [4] yields 

where Wa(3 is a matrix that we wih discuss below. From 
this expression, we see that to eliminate the noise and 
foreground biases, one should pick 



6„ = tr[(C/<,+N)E"], 



(6) 



The presence of Gfg in this expression means that in 
our formalism, we can consider foreground subtraction 
to be a two-step process. The first step acts on the 
data X directly through in forming the quantity 
(x-m)'E"(x-m). As we will see explicitly in Sections 
|IIB and |II C[ this step involves not just the Fourier trans- 
forming and binning as outlined above, but also a linear 
foreground subtraction acting on the data. The result is 
a biased first guess at the power spectrum. The second 
step (where one subtracts off the term) is a statistical 
removal of foregrounds from this first guess, which acts 
on a quadratic function of the data (since it is applied 
to the power spectrum estimate and not the data) and 
is similar in spirit to the methods suggested in [53j. Our 
formalism builds on the work in ^53| and allows one to 
compute the appropriate statistical foreground removal 
term for any quadratic power spectrum estimate — 
one simply plugs the corresponding into Equation [6j 
With ba chosen appropriately. Equation [5] reduces to 
a relation between our power spectrum estimate and the 
true power spectrum: 



P = Wp, 



(7) 



where we have grouped the components of the true power 
spectrum pa and our power spectrum estimate Pa into 
the vectors p and p respectively^, and W is the window 
function matrix^ given by 



W„;3 =tr[C,;3E"]. 



(8) 



In general, W will not be a diagonal matrix, and thus 
each component of our power spectrum estimate vector 
p {i.e. each band power) will be a weighted sum of differ- 
ent components of the true power spectrum. Put another 
way, the power spectrum estimate at one particular point 
in the k±-k\\ plane is not merely a reflection of the true 
power spectrum at that point, but also contains contri- 
butions from the true power spectrum at nearby loca- 
tions on the kj_-k\\ plane. Neighboring points of a power 
spectrum estimate are thus related to one another, and 



give rise to "horizontal error bars" in one's final power 
spectrum estimate. Equation |8] allows one to quantify 
the extent to which foreground subtraction causes un- 
wanted mode- mixing between different parts of /c-space, 
again because the E" matrix can be written to include 
any linear foreground subtraction process. As a general 
rule of thumb, the broader one's window functions the 
greater the information loss in going from the true power 
spectrum to our power spectrum estimate. The window 
functions are thus a useful diagnostic for evaluating one's 
estimation method, and we devote Section |IV A| to ex- 
amining the window functions from various methods for 
estimating the power spectrum. 

In addition to computing the window functions, a com- 
plete evaluation of one's power spectrum estimation tech- 
nique should also involve a computation of covariance 
matrix V of the band powers: 



(9) 



Roughly speaking, the diagonal elements of V give the 
"vertical error bars" on our power spectrum estimate. If 
the signal is Gaussian, Equation [9] can be written as 

= ^[CikCji + CiiCjk]'Efj'E^i. (10) 

ijkl 

Ultimately, one of course wishes to minimize the vari- 
ances {i.e. error bars) on our power spectrum estimate. 
For a deconvolved power spectrum estimate^, i.e., one 
where W = I so that 



(P) = P. 



(11) 



the smallest possible error bars that can be obtained for 
a given experimental set-up can be computed using the 
Fisher matrix formalism. The Fisher information matrix 
is defined as [SH 



dpadp(3 



In/ 



(12) 



where / is the probability distribution for the data vector 
X, and is dependent on both x and the band powers Pa, 
which we defined above. Doing so and assuming that the 
fluctuations are Gaussian allows one to write the Fisher 
matrix as 



Fa/3 = -tr [C^Q,C ^C^f^C 



(13) 



This grouping is not to be confused with our earher grouping of 
data into the vector x. The index on x runs over the spatial grid 
points of one's measured brightness temperature distribution, 
whereas the index of vectors hke p runs over different bands of 
k± and . 



^ A power spectrum estimator that respects Equation |ll| is some- 
times also said to be an unbiased estimator. This is conceptually 
separate from our earlier discussion about choosing appro- 
priately to eliminate noise and foreground bias. Unfortunately, 
both usages of the word "bias" are standard. In this paper we 
reserve the term for the latter meaning, and instead use "unwin- 
dowed" or "deconvolved" when referring to the former. 
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By the Cramer- Rao inequality, the quantity (F~^)l/a 
represents the smahest possible error bar on the band 
power Pa that any method satisfying Equation 11 can 



where F^a are diagonal elements of the Fisher matrix 



achieve if one is estimating all the band powers jointly, 
while (Fc^Q,)"^/^ gives the best possible error if the other 
band powers are already known. In our particular case, 
this means that if a given foreground subtraction and 
power spectrum estimation technique (specified by the 
set of E"s) yields a covariance matrix V that is equal to 
F~^, the technique is optimal in the sense that no other 
unwindowed method will be able to produce a power 
spectrum estimate with smaller error bars. This method 
will turn out to be closely related to the inverse variance 
scheme that we introduce in Section |IIB[ where we ap- 
ply the formalism that we have developed so far to 21 cm 
tomography. Similarly, in Section |II C| we will take the 
traditional line-of-sight subtraction algorithms suggested 
by [371 ESI SDl [43^ and recast them in our mathematical 
framework. We will see in Section [TVl that the traditional 
methods result in a substantial loss of information, result- 
ing in error bars that are larger than one obtains with the 
inverse variance method. 



B. Inverse Variance Foreground Subtraction and 
Power Spectrum Estimation 

Suppose we form the quantity 

qo. = \(?^- m)^C-^C,«C-^(x - m) - 6,, (14) 

and take p = F~^q to be our power spectrum estimate. 
In [45 it was shown that this estimate is precisely the 
unwindowed (i.e., W = I), optimal estimator hinted at 
in the last section. This estimator gives error bars that 
are exactly those specified by the Cramer-Rao bound. 

In practice, however, using this estimator tends to give 
power spectra that are quite noisy. This is because win- 
dow functions naturally have a width of order the inverse 
size of the survey volume, so insisting that W = I en- 
forces a rather ill-posed deconvolution that greatly am- 
plifies the noise. This results in band power estimates 
that have error bars that are both large (despite being 
the "best" allowed by Cramer-Rao) and anticorrelated 
between neighboring bands (see [51] for a more extensive 
discussion). To avoid these issues, it is often preferable 
to smooth the power spectrum on the k^-k\\ plane, which 
mathematically means having a nondiagonal W. This in 
turn implies that we have p = Wp, which allows us to 
evade the Cramer-Rao bound since the p = p require- 
ment is no longer enforced. In other words, by allowing 
some smoothing of our power spectrum estimate, we can 
construct an estimator with smaller error bars than those 
given by the Cramer-Rao inequality. 

In this paper, we will construct such an estimator by 
choosing 



1 



2Fq,q, 



(15) 



given by Equation |T3] Just like with Equation [T4j the 
choice of Equation ^] for represents an inverse vari- 
ance weighting of the data. This can be seen by inserting 
Equation [15] into Equation [2] 



1 



Pa 



2Fn 



-(x - m)'C-^C,c.C-^(x - m) - he,. (16) 



Since the covariance matrix C is symmetric, using Equa- 
tion [15 corresponds to using an inverse variance weighted 



data vector (C~^x) to perform our power spectrum esti- 
mate. This weighting procedure acts as our foreground 
subtraction step, since in Equation [4] we included the 
foregrounds in our covariance matrix. The residual noise 
and foreground bias is subtracted by the he, term, which 
is obtained by substituting Equation [15] into Equation 
[g] The (nondiagonal) window functions are obtained by 
inserting Equation [15] into Equation [Sj and it is by im- 
posing the normalization condition Wq,q, = 1 (as is stan- 
dard) that the {Faa)~^ normalization factor appears. 
Throughout this paper. Equations [Ts] [TB} and the corre- 
sponding window functions are what we refer to as "the 
inverse variance method" . 

In [45", it was shown that in the limit that the data 
vector X is drawn from a Gaussian distribution^, the in- 
verse variance method beats all other power spectrum 
estimators, in the sense that no other method — win- 
dowed or not — can deliver smaller error bars on the 
final power spectrum estimates. To compute these error 
bars, we insert Equation 15 into Equation 10 and (as- 



suming Gaussianity in a manner similar to [45]), obtain 



(17) 



In particular, the "vertical error bars" on a particular 
band power are given by the diagonal elements of V, 
giving 



1 



(18) 



^ Gaussianity is certainly not a good assumption for 21 cm tomog- 
raphy. However, even if not strictly optimal, an inverse variance 
weighting of data is often desirable. Moreover, in the presence 
of strongly non-Gaussian signals, the power spectrum itself is a 
non-optimal statistic in that it fails to capture all the cosmo- 
logical information present in the raw data. Thus, the mere act 
of replacing the data by a power spectrum is in this sense a 
Gaussian approximation, and in that limit the inverse variance 
method is the optimal one. Note also that while the formulae for 
the error bar estimates (such as Equation |18| may suffer from 
inaccuracies if there are non-Gaussianities, the expressions for 
the window functions (Equation [?]) and the noise and foreground 
bias (Equation [6| remain strictly correct. This is because the 
window functions and the bias terms are derived from manipu- 
lating Equation [2] which only involves second moments of the 
data vector x (and are therefore completely describable in terms 
of covariances) , whereas the error bars come from Equation [9] 
which implicitly depends on fourth moments of the data. 
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This means that the inverse variance method dehvers er- 
ror bars that are smaher than those given by the Cramer- 
Rao bound, because here our error bars are equal to 
(Fao;)"^^^, something which can only be achieved by an 
unwindowed estimator if all but one of the band powers 
are known beforehand. The windowing has thus achieved 
our goal of producing a less noisy power spectrum with 
smaller error bars. Note that this outcome was by no 
means guaranteed — for instance, in Section [IV| we will 
find that the traditional line-of-sight methods described 
in Section III CI essentially smooth the k_\_-k\\ plane too 



much, resulting in window functions that are so broad 
that there is a substantial loss of information, which in 
turn causes larger error bars than one obtains with the 
inverse variance method. 

Comparing the inverse variance method to the unwin- 
dowed estimator p = F~^q discussed earlier, we see that 
the difference lies in whether one normalizes the power 
spectrum using F"-*- or {Faa)~^- While we have just seen 
that the latter gives smaller error bars on the power spec- 
trum, the choice of normalization becomes irrelevant as 
one goes beyond the power spectrum to constrain cosmo- 
logical parameters. This is because both choices consist 
of multiplying by an invertible matrix, and thus no infor- 
mation is lost. This is not true for the traditional line- 
of-sight algorithms, where the non-optimal error bars on 
the power spectrum are due to an irreversible loss of in- 
formation, which will in turn cause larger error bars on 
cosmological parameters. 



C. Line-of- Sight Foreground Subtraction 

In a typical^ line-of-sight (LOS) foreground subtrac- 
tion method, the measured signal from each LOS is plot- 
ted as a function of frequency (or, if one prefers, of ra- 
dial distance) and a low-order polynomial is fit to the 
data. The low-order fit is then subtracted from the data, 
with the hope that the remaining signal varies sufficiently 
rapidly with frequency to be dominated by the cosmo- 
logical contribution to the signal and not by foregrounds 
(which are spectrally smooth and therefore expected to 
be well-approximated by low-order polynomials). Math- 
ematically, if one arranges the elements of the data vector 
X so that one cycles through the radial/frequency direc- 
tion most rapidly and the perpendicular/angular direc- 
tions less rapidly, the action of a LOS foreground subtrac- 
tion can be described by the equation z = Dx, where z 
is the foreground-cleaned data and D is a block diagonal 
matrix. That D is block diagonal is simply an expression 
of the fact that different lines-of-sight are independently 
cleaned in this algorithm, i.e. there is no attempt to use 



^ As noted above, variants of the LOS method exist, but we expect 
the conclusions of this paper to be independent of our specific 
implementation. 



the angular structure of foregrounds for cleaning. If one's 
data cube has n\\ pixels along the line of sight direction 
and a total of n_\_ pixels in the perpendicular directions, 
then D consists of n± blocks, each of size ny x ny and of 
the form 



D 



single block 



I-X[X*X]-iX* 



(19) 



where I is identity matrix, X is an ny x (m + 1) matrix 
such that X^j equals the ith frequency (or radial pixel 
number) taken to the {j — l)th power, and m is the order 
of the polynomial fit. In [39l |40] it was found that a 
quadratic polynomial ought to be sufficient for cleaning 
foregrounds below the level of the expected cosmological 
signal. In general, one should select a polynomial that 
is as low an order as possible to mitigate the possibility 
of accidentally removing part of the cosmological signal 
during foreground subtraction. 

Once foreground subtraction has been performed, we 
can form the power spectrum by Fourier transforming, 
binning, and squaring. The Fourier transform and bin- 
ning are accomplished by the matrix C^c^, which is given 
by Equation [sj Subsequently multiplying by z^ squares 
the results, so the final estimate of the power spectrum 
takes the form 



Pa 



^LOS 



z^C,c,z = (x - m)^DC,c,D(x - m). (20) 



Comparing this to Equation |2j it is clear that the LOS 
foreground subtraction methods (and subsequent power 
spectrum estimations) proposed in the literature are 
quadratic methods with = DC^^D and bo, = 0. The 
window functions can be readily computed by substitut- 
ing = DC^c^D into Equation |8j giving 



tr[C,„DC,^D]. 



(21) 



As we shall see in Section [IV| the off-diagonal elements of 
this matrix are large, which implies that LOS foreground 
subtraction has the effect of introducing large correla- 
tions between the power spectrum estimates at different 
parts of the k±-k\\ plane. 

Comparing the LOS algorithm to the optimal inverse 



variance algorithm described in Section II B[ we can iden- 
tify three areas in which the LOS algorithm is non- 
optimal: 

1. The method produces power spectra that 
still contain a residual noise and foreground 
bias. This conclusion has been numerically con- 
firmed by [39 , [44 , and [46 , and trivially falls out 
of our analytic framework — Equation [6] shows that 
to eliminate the residual bias, one must subtract 



lLOS 



tr[(C^, + N)ET 
tr[(C/, + N)DC,«D] 



(22) 



from the initial estimate, so the setting of b^^*^ 
to zero means the method is biased. Fortunately, 
since D and C^a are known a prio ri and Cfg and 
N can be modeled (see Section III for details), this 
bias can be easily removed. 
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2. The method does not make full use of the 
available data. The foreground cleaning in the 
LOS method can be thought of as a projection of 
the data vector x into the subspace orthogonal to 
low order polynomials in the frequency direction^. 
To see that, note that the matrix D defined above 
takes the form of a symmetric (D = D^) projection 
matrix (D^ = D), so using the vector z = Dx to 
estimate our power spectrum essentially amounts 
to limiting our analysis to the subset of data or- 
thogonal to the polynomial modes. Such a proce- 
dure, where one projects out the modes that are 
a priori deemed contaminated, is exactly analo- 
gous to similar techniques in CMB data analysis 
(where one customarily removes the monopole and 
dipole modes, as well as pixels close to the Galac- 
tic plane) and galaxy survey analysis (where one 
might project out some purely angular modes in 
order to protect against incorrectly modeled dust 
extinction). 

The projecting out of contaminated modes will nec- 
essarily result in larger error bars in one's final 
power spectrum estimate, because cosmological in- 
formation is irreversibly destroyed in the projection 
procedure. Indeed, this has been a concern with 
LOS subtraction, for any component of the cosmo- 
logical signal that is non-orthogonal to low-order 
polynomials in frequency will be inadvertently sub- 
tracted from the data. At best, if the cosmological 
signal happens to be completely orthogonal to the 
polynomial modes, estimating the power spectrum 
from the projected data will give error bars that 
are identical to an optimal method that uses the 
full data, since the optimal method can simply as- 
sign zero weight to the contaminated modes. 

3. The method is a non-optimal estimator even 
for the projected data. The prescription given 
by Equation |20] calls for a uniform weighting of the 
projected data z in the estimation of the power 
spectrum. In it was shown that ideally one 
should instead apply inverse variance weighting to 
the remaining data. Since the covariance matrix of 
the reduced data C = DCD is singular^, the in- 
verse variance weighting is accomplished using the 
so-called pseudoinverse, given by 



M = D[C + 7?XX^]-^D, 



(23) 



where 77 is a non-zero constant and the matrix X 



is the same as that defined for Equation 19 This 



^ A complementary treatment of LOS foreground subtraction that 
also casts the subtraction as a projection of data onto a subspace 
of polynomials can be found in pi- 

^ An unsurprising result, given that we have thrown away select 
modes in x. 



pseudoinverse can be shown to be independent of 
T] [45 , and has the property that CM = I in the 
subspace of remaining modes, as one would desire 
for an inverse. To estimate a power spectrum with 
an inverse variance weighting of the projected data, 
one simply acts with the pseudoinverse after the 
projection: 



MDx = Mx, 



(24) 



where in the final step we made use of the fact that 
so MD = M from Equation [23 



D2 = D, 



With this weighting, the resulting power spectrum 
estimate becomes optimal for the projected data, 
in that the covariance matrix \'ai3 becomes equal 
to the expression given in Equation [T7| except that 
one uses not the full Fisher matrix F but the Fisher 
matrix of the projected data F. This Fisher matrix 
can be proven [45] to take the same form as Equa- 
tion [Tsj except with the pseudoinverse taking the 
place of the inverse covariance: 



F^^ = ^tr[C,„MC,;3M]. 



(25) 



In Sectio n |IV| we will use Equation 25 (inserted into 
Equation |18|) to estimate the errors for the LOS method, 
even though the LOS method as proposed in the liter- 
ature does not optimally weight the data. This is be- 
cause once M has been computed, the optimal weighting 
can be accomplished relatively easily, and moreover most 
forecasts of the ability of 21 cm tomography to constrain 
cosmological parameters assume that the weighting is op- 
timal (see for instance [4]). The LOS method that we use 
to generate the results in Section [TV| should therefore be 
considered an improved version of the traditional meth- 
ods found in the literature. Even so, we will find that 
the inverse variance method does better. 



III. FOREGROUND AND NOISE MODELING 

Unlike the original LOS subtraction discussed in the 
first part of Section [TTCl the optimally weighted version 
of the LOS subtraction scheme and the pure inverse vari- 
ance scheme of Section |IIB| perform foreground subtrac- 
tion in ways that are not blind. In other words, to in- 
verse variance weight the data it is necessary to have a 
model for the covariance matrix C, which in turn means 
that one must have a foreground model, since the fore- 
grounds appear in Equation [4j We note, however, that 
since Equation [t] depends only on geometry (via C^/3) 
and one's chosen power spectrum estimation method (via 
E"), the window function matrix W^is remains strictly 
correct even if the foreground model is not. In other 
words, even though the matrix for our inverse vari- 
ance scheme involves factors of C~^, this does not detract 
from the accuracy of our window functions, since an in- 
correct C will simply result in a different E", which will 
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FIG. 1: Schematic visualization of various emission components that are measured in a 21cm tomography experiment. From 
left to right: The geometry of a data "cube", with the line-of-sight direction/frequency direction as the z-axis; resolved point 
sources (assumed to be removed in a prior foreground step and therefore not included in the model presented in Section III), 
which are limited to a few spatial pixels but have smooth spectra; Galactic synchrotron radiation and unresolved point sources, 
which again have smooth spectra, but contribute to every pixel of the sky; and detector noise, which is uncorrelated in all three 
directions. 



in turn give different window functions that nonetheless 
accurately reflect what was done to the data. It should 
also be noted that even with schemes where the fore- 
ground subtraction is completely blind, a proper estima- 
tion of the error bars will necessarily involve a foreground 
model. We also re-emphasize that the methods presented 
in Section |ll| are generally applicable to any foreground 
model, and that the foreground model described in this 
section is used only for the numerical case study pre- 
sented in Section HVl 

The foreground model that we use in this paper con- 
tains the following features: 

1. Following [an Emni [43], we assume that bright, 
resolved point sources above some flux Smax have 
been identified and removed from the data. This 
may be done using traditional radio astronomy al- 
gorithms such as CLEAN [55] [56] or more recently 
developed techniques suitable for low-frequency ra- 
dio interferometers (e.g. [57t ISS])."*-^ 

2. Based on extrapolations from CMB data [60], we 
assume that free-free emission is negligible com- 
pared to Galactic synchrotron radiation at the rele- 
vant frequencies. Since both free- free emission and 
Galactic synchrotron radiation have spectra that 
are well-described by the same functional form [37 , 
we can safely ignore free-free emission for the pur- 
poses of this paper, because its contribution would 
be subdominant to the uncertainties inherent in the 
Galactic synchrotron parameters. 



For an analysis of how such bright point sources can affect the 
errors on one's final power spectrum, see 59 . Their analysis 
is complementary to what is done in this paper, in that they 
consider the effects of bright point source removal, whereas we 
consider the effects of spatially extended foregrounds. 



3. With bright point sources removed and free- free 
emission safely ignorable, the foreground sources 
that remain are unresolved point sources and 
Galactic synchrotron radiation. 

Since the two foreground sources in our model are in- 
dependent in origin, we may assume that their contri- 
butions to the signal are uncorrelated, thus allowing us 
to write the foreground covariance Gfg as the sum of 
an unresolved point source covariance Gpt and a Galac- 
tic synchrotron covariance Osync Figure [T] provides a 
visualization of these various components, and in the fol- 
lowing subsections we consider each term in turn. 

A. Unresolved Point Sources 

Consider first the unresolved point source covariance. 
Since the unresolved point sources can still be consid- 
ered "local" in radial distance compared to the cosmo- 
logical signal, the frequency dependence of these fore- 
grounds is due entirely to the chromaticity of the sources. 
This means that correlations in the LOS/frequency "di- 
rection" are independent of the spatial distribution of the 
sources, and the covariance matrix can be modeled as a 
separable function in frequency and angular position: 

Cp^(r,r') ^ ((x(r)-m(r))(x(r')-m(r'))*> 

= T{v,v')e{v^,v'^). (26) 

The function 9(rx,r'x) encodes the spatial distribution 
of point sources, and is therefore where the clustering of 
point sources manifests itself. While point sources are 
known to be clustered, the clustering is weak and for 
many applications it is permissible to ignore the cluster- 
ing [61 . If clustering is ignored, the number of sources in 
a given pixel on the sky can be modeled using a Poisson 
distribution, and the resulting angular power spectrum 
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of point sources is fiat {i.e. independent of angular scale) 
[62] , This gives rise to a perpendicular correlation func- 
tion/covariance proportional to ^(r^ — r'^). 

It should be emphasized, however, that it is certainly 
not necessary to assume that point sources are unclus- 
tered. Whether or not clustering is included, the fore- 
ground cleaning and power spectrum estimation proce- 
dure described in Section IIIBI remains the same. One 
simply inserts a different covariance matrix C into the 
quadratic estimator. In fact, the inclusion of clustering 
will aid foreground subtraction if the clustering pattern 
differs significantly from that of the cosmological signal. 

Unfortunately, only future experiments will be able to 
probe the angular clustering of point sources on the fine 
arcminute scales probed by 21 cm tomography. For sim- 
plicity, we therefore model the O function as a Gaussian 
in the perpendicular separation r± = \t_\_ — r'^|. With- 
out loss of generality, we can normalize so that B = 1 
when = r'^ , because the overall amplitude of the fore- 
grounds can be absorbed into r(i^, u^). This means that 



e(r^,rl) 



(27) 



where the ax-parameter is chosen to take the small value 
~ 7 arcminutes to reflect the weakness of the clustering. 

In contrast to the perpendicular directions, one expects 
correlations in the LOS direction to be strong. In other 
words, r{h'^u') should be large even for v ^ v' because 
of the high spectral coherence of foregrounds. To com- 
pute this correlation, consider first a simplified model 
where we have a population of point sources with an av- 
erage number density n per steradian, giving an average 
of nQpix sources per sky pixel of size ftpix steradians. 
We imagine that all the point sources in this population 
have the same fiux at frequency = 150 MHz, and 
a power law frequency spectrum ex with a spectral 
index a drawn from a Gaussian distribution 



p{a) 



exp 



(g - gp)^ 
2^2 



(28) 



with cTq, = 0.5 [60] and to be fixed later. With this 
foreground model, the average signal along a particular 
LOS is given by 



m{u) = {x{v)) = yo^j i'riQpixS^) J l^ — j p{a)da, 

(29) 

where A^^/Qpix is a conversion factor that converts our 
expression from fiux units to temperature units. The 
u subscript serves to remind us that it is a function of 
frequency, and numerically we have 



'AS 



1.4 X 10" 



^pix 

1 sr 



mJy-^ K. 

(30) 



Similarly, the covariance (or correlation function) is 



2 '^^^pix^* / 
nix J 



p{a)da. (31) 



Note that the covariance is proportional to n and not v? 
because in a Poisson distribution^^ the mean is equal to 
the variance. Evaluating the integral gives 



nv,v') 



^71 



-nSi 



-Q!0+- 



- In 



(32) 



which shows that the superposition of power law spectra 
with Gaussian-distributed spectral indices gives exactly 
a power law with a positive running of the spectral index. 

Now, in reality one of course has multiple populations 
of point sources with varying brightness S^. If we treat 
these populations as independent (in a similar fashion to 
what was done in [62]), then the total covariance due to 
all populations is given by the sum of the covariances 
of the individual populations. In the limit of an infinite 
number of populations, one has 



pix Jo 



"""" Q2,n 



-Q!o + ^ In 



(33) 

where dn/dS is the differential source count and Smax 
is a maximum point source fiux above which we assume 
the sources can be resolved and removed from the data 
prior to applying our foreground subtraction and power 
spectrum estimation scheme. Simulations of so-called 
"peeling" techniques have suggested that resolved point 
sources can be reliably removed down to 

Smax ^ 10 to 

100 mJy In this paper, we go a little above the 

middle of this range and take Smax ^ 60mJy. For the 
differential source count we use empirical fit of [32 , which 
takes the form 

^ = (4.0.Jr's.-)(55^)"". (34) 

and putting everything together, we obtain 



r(i/,j/') = (149 K^) 



^ni 



10-6 sry/ 



'pzx 



(35) 

where aps = ao + 2 = 2.5, with this numerical value 
chosen^^ to match extrapolations from CMB data [60|. 



This is not to be confused with the spatial distribution of the 
sources. What is being modeled as Poisson distributed is the 
number of sources in a given pixel. 
-•^^ For the results presented in Section |lV| we in fact used larger 
values for both aps and cxsync- We estimate the difference to 
be no larger than 1% at any frequency. Moreover, with larger 
spectral indices the foregrounds become steeper functions of fre- 
quency, and are therefore harder to subtract out, making our 
results more conservative. 
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Combining this with Equation [27] gives us our expres- 
sion for the unresolved point source foreground covari- 
ance matrix. 



B. Galactic Synchrotron Radiation 

With the synchrotron contribution, it is more difficult 
to write down the form of the covariance matrix, because 
it is unclear how to rigorously define the ensemble aver- 
ages required in the computation of the covariance. In 
the frequency direction, we simply use the same expres- 
sion as we obtained for the unresolved point sources. We 
expect this to be a reasonable model for two reasons. 
Physically, the emission from unresolved point sources 
is also synchrotron radiation, albeit from distant galax- 
ies. Empirically, the spectrum of Galactic synchrotron 
foregrounds is well-described by the same parametric fits 
that work well for the unresolved point sources, except 
with slightly different parameters |371 160] . We thus reuse 
Equation 35 with apg OLsync = 2.8, <Jq, = 0.4, and 
the amplitude of T {including the Vtpix factor) to be 
1.1 X 10^ ah of which are values more suitable for 



Galactic synchrotron radiation As for the perpen- 
dicular part of the covariance matrix, we once again run 
into the same problem we faced with the unresolved point 
sources, namely that there is a lack of empirical spatial 
correlation data on the fine scales of interest. For simplic- 
ity, we use Equation [27| for the Galactic synchrotron fore- 
ground contribution as well, but with ~ 10 degrees to 
reflect the fact that the Galactic synchrotron contribu- 
tion is expected to be much more spatially correlated 
than the point sources are [60^ . 



C. Instrumental Noise 

To model the instrumental noise contribution to the 
covariance matrix, we use the standard formula for the 
uncertainty in a measurement of the spatial wavevector 
kx component of the sky brightness: 



AT(k^,rll) 



(36) 



where A is the observing wavelength, Av is the channel 
bandwidth of the instrument, is the effective area of 
an antenna, Tgys is the system temperature of the in- 
strument, and Tk is the total integration time that the 
instrument spends observing Fourier mode (see [63] 
for a pedagogical discussion). The dependence on the 
radial distance r" enters via A, whereas the dependence 
on k^ enters via r^. The amount of time that each 
Fourier mode is observed for is greatly affected by the 
layout of one's interferometer array, since each baseline 
in the array probes a particular Fourier mode k^ at any 
one instant. Once AT has been determined, one must 
take Fourier transforms in the two perpendicular direc- 
tions to obtain AT in position space. One can then form 



the quantity N = (AT(r)AT(rO), with the additional 
standard assumption that the noise is uncorrelated be- 
tween different frequencies. 

In this paper, we pick fiducial system parameters that 
are similar to those of the MWA. Specifically, the results 
in the following sections are computed for an interferom- 
eter array with 500 antenna tiles randomly distributed 



with a density varying as 



where r is the distance 



from the center of the array. The length of the maximum 
baseline is taken to be 1500 m, and we assume 1000 hrs 
of observing time. Rotation synthesis is performed with 
the telescope located at the South Pole for simplicity. 
The channel width of a single frequency channel is set at 
Av = 30 kHz, the system temperature at Tgys = 440 K, 
and we take the effective area of each antenna tile to 
be Ae = 15 m^. These quantities can be taken to be 
roughly constant over the narrow frequency range of 150 
to 150.9 MHz (corresponding to 30 adjacent frequency 
channels) that we take to define the radial boundaries of 
the survey volume we use for the numerical case study 



of Section IV In the perpendicular directions we take 
our field of view to be 1.2 deg^, which gives 16 pixels 
along each perpendicular direction if we take the natu- 



ral pixel size of Qp 



4.5 arcmin that is suggested by 



our array configuration. This results in a data cube with 
16 X 16 X 30 = 7680 voxels^^, which is a tiny fraction 
of the total data output of a typical 21cm tomography 
experiment, but a reasonable amount of data to analyze 
for several reasons: 

1. Cosmological evolution of the signal. As we 

remarked in Section II A[ the power spectrum is 
only a sensible statistic if translational invariance 
is assumed, and this in turn rests on the assump- 
tion that the cosmological evolution of the signal 
is negligible over the redshift range of one's data 
cube. In practice this means that one must mea- 
sure the power spectrum separately for many data 
cubes, each of which has a short radial extent. The 
configuration that we use for the numerical case 



study in Section |IV| should be considered one such 
small cube. 

2. Comparisons between inverse variance sub- 
traction and line-of-sight subtraction. In [40j, 

it was shown that line-of-sight foreground subtrac- 
tion is most effective over narrow frequency ranges. 
As we will discuss in Section VI B[ the inverse vari- 
ance method suffers no such limitation. However, 
we select a narrow frequency range in order to 



Note that while we speak of organizing the data into a cube (see 
the left-most graphic in Figure Q , the physical survey volume 
need not be cubical. In our computations, for instance, we take 
into account the fact that light rays diverge, thus giving a survey 
volume that takes the form of a small piece of a spherical shell. 
The formalism described in Section [H] can be applied to any 
survey geometry, even ones that are not contiguous. 
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demonstrate (see Section IV) that the inverse vari- 



ance method does a better job cleaning foregrounds 
even in that regime. 



3. Computational cost. Even with n 



'pzx 



16^ X 

30 = 7680 voxels, the computational cost is sub- 
stantial. This is due to the fact that one must 
multiply and/or invert many 7680 x 7680 ma- 
trices to use Equations [6j [8| and 10 Because 
of this, the results in Section |IV| required more 
than half a terabyte of disk space for matrix stor- 
age and a little over a CPU-year of computation. 
The problem quickly gets worse with increasing 
^pix") Since a straightforward implementation of the 
computations involved scales as 0{np^^). Fortu- 
nately, for large datasets there exist iterative algo- 
rithms for power spectrum estimation that scale as 
0{npix log Upix) [64, 65 , and with a few approxi- 
mations (such as the flat-sky approximation) these 
fast algorithms can be suitably adapted to perform 
unified foreground subtraction and power spectrum 
estimation for 21cm tomography as we have done 
in this paper [66] . 

We stress, however, that the formalism described in 
Section [ll] is in principle applicable to arbitrarily large 
survey volumes of any geometry. 



D. Cosmological Signal 



For the computations in Section [IVj we ignore the cos- 
mological signal. This is justified for several reasons. 
First, the signal is expected to be at least a factor of 
10^ smaller than the foregrounds [19 , which means that 
including the signal would make little difference to the 
inverse variance foreground cleaning steps, as C would 
be essentially unchanged. The line-of-sight methods, of 
course, remain strictly unchanged as they are blind. De- 
spite this, one might still object that if foreground clean- 
ing is successful, then we should expect the cosmological 
signal to be the dominant contribution to the power spec- 
trum in at least a portion of the k±-k\\ plane. This is most 
certainly true, and the simulation of an actual measure- 
ment of a power spectrum would be incorrect without 
including the cosmological signal as input. However, in 
this paper we do not show what a typical measured power 
spectrum might look like, but simply seek to quantify the 
errors and biases associated with our methods. In this 
context, the main effect of a cosmological signal would 
be to add a cosmic variance contribution to our error 
bars at low /cy and k±, but as we will see in Section IV 
it is precisely at the large scale modes that foregrouW 
subtraction is least successful. The cosmic variance er- 
rors would thus be dominated by the errors induced by 
imperfect foreground subtraction anyway. 

In short, then, it is not necessary to include the cosmo- 
logical signal in our computations, and summing just Cfg 



and N yields a fine approximation to the full covariance 
matrix of the system C for our purposes. 



IV. COMPUTATIONS AND RESULTS 

In this section we compute window functions, covari- 
ances, and biases for both the inverse variance and LOS 
methods described above. Using Equations |6j [Sj and[Toj 
our computations can be performed without the input 
of real data, and depend only on the parameters of the 
system. 

In what follows, we choose the /c-bands of our power 
spectrum parameterization to have equal logarithmic 
widths. In the direction we have 30 bands spanning 
from O.lMpc"^ to lOMpc"^. In the direction we 
again have 30 bands, but this time from 0.01 Mpc~^ to 
lMpc~^. Note that these bands can be chosen indepen- 
dently of the instrumental parameters, in the sense that 
a quadratic estimator can be formed regardless of what 
(/c^, /cll)-values are chosen, although the quality of the 
power spectrum estimate will of course depend on the 
instrument. 



A. Window Functions 

In evaluating the quality of one's power spectrum es- 
timator, one is ultimately interested in the size of the 
error bars. This means that the quantity of interest is 
the covariance matrix V. In this section, however, we 
first examine the window functions. As we saw from the 
discussion in Section pi B[ the window functions are inti- 
mately connected to the covariances, and often provide a 
more detailed understanding for why the error bars be- 
have the way they do. In particular, window functions 
that are wide and badly behaved tend to come hand in 
hand with large error bars. Moreover, our expression for 
the window function matrix (Equation [t]) remains strictly 
accurate even in the presence of non-Gaussian signals (see 
footnote |6|, so in many ways the window function matrix 
W is a more robust diagnostic tool than the covariance 
matrix V. 

In the top panel of Figure |2j we examine a situation 
where C = I, corresponding to a scenario where no fore- 
grounds are present and the signal is dominated by white 
noise. This is of course a highly unrealistic situation, and 
we include it simply to build intuition. Rather than plot- 
ting the raw window functions, we show the normalized 
quantities 



w 



al3 



(37) 



where W is given by Equation [7| and like before, each 
Greek index corresponds to a particular location on the 
k±-k\\ plane. Each set of contours shown in Figure |2] cor- 
responds to a specific row of W. For example, the set 
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of contours at the center of the "cross" in each panel 
of Figure |2] has a fixed index /3 that corresponds to 
{k±,k\\) = (0.092, 0.92) Mpc"^ and is plotted as a func- 
tion of a, or equivalently, as a function of k± and /cy. 
From Equation [7| we see that each set of contours de- 
scribes the weighted average of neighboring points of the 
true power spectrum p that one is really measuring when 
forming the power spectrum estimate p. With C set to 
the identity, the nonzero width of the window functions 
is due solely to the finite volume of our survey, and any 
apparent widening of the window functions towards low 
k± and /cy is due mostly to the logarithmic scale of our 
plots. 

With foregrounds present, the precise shape of the win- 
dow functions will depend on the foreground subtraction 
algorithm employed. Shown in the middle panel of Fig- 
ure [2] are window functions for the LOS foreground sub- 
traction described in Section III CI These window func- 
tions have the same /c^-widths as those in the top panel 
of Figure [2j which is expected since the LOS foreground 
subtraction scheme operates only in the frequency /radial 
direction. On the other hand, foreground subtraction in- 
creases the -widths of the window functions, although 
this effect is significant only in the lower k\\ regions of the 
k±-k\\ plane. At high k\\ , the window functions are very 
similar to those for the case with no foregrounds, since 
(by design) the LOS subtraction of low-order polynomials 
has a negligible effect on high k\\ Fourier modes of the sig- 
nal. As one moves to lower /cy , the foreground subtraction 
has more of an effect, and the window functions widen 
in the radial direction. This results in large "horizontal" 
error bars in the k\\ direction for the final power spec- 
trum estimate. In addition, the widening of the window 
functions represents a loss of information at these low /cy 
values (which is unsurprising, since the very essence of 
the LOS method is to project out certain modes), and in 
Section [IV B| we will see this manifesting itself as larger 
"vertical" error bars on the power spectrum estimate. 

Shown in the bottom panel of Figure [2] are the window 
functions for the inverse variance foreground subtraction 



described in Section II B The same elongation effect is 
seen in the k\\ direction, but the problem is less severe 
in the sense that severe elongation does not occur until 
one reaches much lower values of /cy . This is illustrated in 
Figure [sj where we plot the fractional increase (compared 
to the foreground-free case) in window function width in 
the k\\ direction for both the LOS method (red, solid 
line) and the inverse variance method (blue, dashed line) 
as a function of the /cy coordinate of the central peak of 
the window function. The width is defined to be the full- 
width-99%-max^^ (in logarithmic /c-space) of the window 



37) 



FIG. 2: Normalized window functions W (see Equation 
for an unrealistic situation with no foregrounds (top panel) 
the line-of-sight foreground subtraction described in Section 
lie (middle panel), and the inverse variance foreground sub- 
traction described in Section II B (bottom panel). 



We define the full-width-99%-max as the full width at which 
the function has dropped to 99% of its peak value. A more 
conventional measure of the width like the full-width-half-max 
(FWHM) is not feasible for our purposes, as the window func- 
tions are elongated so much by foreground subtraction that if we 
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FIG. 3: Fractional increase (compared to the foreground- free 
case) in window function width in the k\\ direction, plotted 
as a function of the k\\ location (on the k±-k\\ plane) of the 
window function peak . The blue, dashed line is for inverse 
variance foreground subtraction, whereas the red, solid line 
is for the LOS subtraction. The various data symbols denote 
widths measured at different values of k±, and the fact that 
they all lie close to the curves suggest that the window func- 
tion width in the k\\ direction is largely insensitive to the k± 
location of the window function peak. The width is defined as 
the full-width of the window function in logarithmic /c-space 
at which the function has dropped 1% from its peak value. 
The solid line is to guide the eye, and marks a fractional in- 
crease of unity i.e. where the width is double what it would 
be if there were no foregrounds. 



in either the k\\ or k± direction, and as expected there 
is essentially no elongation at regions of high k\\. Fore- 
ground subtraction begins to have an elongating effect on 
the k\\ widths of the window functions at k\\ < lMpc~^, 
just like we saw from our comparisons of the three pan- 
els of Figure [2j The effect is much more pronounced for 
the LOS subtraction, and as an example we can consider 
the k\\ value at which the fractional increase in width is 
unity {i.e. where foregrounds cause the window functions 
to double in width in the /cy direction). This occurs at 
k\\ = 0.7Mpc~^ for the LOS subtraction, but not un- 



defined the widths in such a way, the widths for the windows at 
the very lowest /c|| would run off the edge of our simulation. Our 
measure of the width allows a quantification of the elongation 
even for the lowest /c|| values. In all regimes where a meaningful 
measurement of the band powers can be made, the window func- 
tions will be compact enough for the width to be defined using 
the FWHM. With the inverse variance method, for instance, one 
sees from the bottom panel of Figure [2] that the FWHM remains 
nicely within the simulation box even though it is larger than for 
the foreground- free case. 
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FIG. 4: A comparison of two normalized window functions 
W centered at (k±,k\\)^ (0.0259, 0.2593) Mpc"^ (marked by 
"x"). The dotted contours correspond to a scenario with no 
foregrounds, while the solid contours correspond to the win- 
dow functions for the inverse variance foreground subtraction 
described in Section FlI Bl Note the linear scale on both axes. 



til = 0.5 Mpc~ for the inverse variance subtraction, 
which means that with the latter scheme, one can push 
to lower values of k\\ before there is significant informa- 
tion loss and the power spectrum band estimates become 
massively correlated. Note that this conclusion holds for 
all k± , as evidenced by the way the various plot symbols 
for different k± all lie very close^^ to the curves in Figure 

El 

Unlike the /cy widths, with the k± widths we find that 
there are important qualitative differences in addition to 
quantitative differences between the two methods. This 
is evident in Figure |4j where we compare inverse vari- 
ance and foreground-free window functions (solid and 
dashed contours respectively) centered at {k±^k\\) = 
(0.0259, 0.2593) Mpc"\ The analogous plot for the LOS 
method is shown in Figure |5j As discussed before, there 
is an elongation in the /cy direction that is more pro- 
nounced in the LOS method than the inverse variance 
method. For the LOS method, this is all that happens, 
because the foreground subtraction is performed using 
only line-of-sight information. In contrast, since the in- 
verse variance method works by de- weighting parts of the 
data that are heavily contaminated by foregrounds, it is 
capable of also leveraging information in the transverse 
direction. This results in the slight widening in the k± 



^ For the LOS subtraction, this invariance is in fact exact since 
the algorithm performs the same polynomial subtraction along 
the line-of-sight regardless of what happens in the transverse 
directions. 
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FIG. 5: A comparison of two normalized window functions 
W centered at {k±,k\\)^ (0.0259, 0.2593) Mpc"^ (marked by 
"x"). The dotted contours correspond to a scenario with no 
foregrounds, while the solid contours correspond to the win- 
dow functions for the LOS foreground subtraction described 
in Section FlI CI Note the linear scale on both axes. 



direction seen in Figure [4j 

In Figure [6j we show the fractional increase in k± width 
(again as compared to the foreground- free windows) as 
a function of k±. The width increases towards lower k±_ 
because the inverse variance foreground subtraction is 
taking advantage of the smooth angular dependence of 
Galactic synchrotron radiation to accomplish some of the 
foreground cleaning. One can also see that the width 
decreases with increasing /cy, which is due again to the 
fact that the spectral smoothness of foregrounds mean 
that they dominate mainly the low k\\ modes, and so the 
inverse variance subtraction need not act so aggressively 
at high k\\. 

In summary, what we see is that since the inverse vari- 
ance method downweights data selectively on the basis of 
expected foreground contamination, its approach to fore- 
ground subtraction is more "surgical" than that of the 
LOS method. For instance, at intermediate k\\ (where 
foregrounds are non-negligible but certainly not at their 
peak), it subtracts foregrounds in a less aggressive man- 
ner than the LOS method does, allowing for final power 
spectrum estimates in those regions to be less correlated 
with each other. This is still the case in low k_\_^ low /cy 
regions, but here the algorithm also uses angular infor- 
mation for its foreground subtraction. Put another way, 
the inverse variance technique automatically "chooses" 
the optimal way to subtract foregrounds. As a result, 
the fact that the elongation of window functions in the 
k\\ direction is so much greater than the widening in the 
k± direction (notice the different scales on the vertical 
axes of Figures [s] and [6| means that we now have quan- 
titative evidence for what has thus far been a qualitative 
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FIG. 6: Fractional increase (compared to the foreground- free 
case) in window function width in the k± direction, plotted 
as a function of the k± location (on the k±-k\\ plane) of the 
window function peak . The blue, dashed line is for inverse 
variance foreground subtraction, whereas the red, solid line is 
for the LOS subtraction. The dotted and dash-dotted curves 
display the same quantity as the dashed line, but at lower k\\. 
The width is defined as the full- width of the window function 
in logarithmic /c-space at which the function has dropped 1% 
from its peak value. 



assumption in the literature — that foreground subtrac- 
tion for 21-cm tomography is most effectively performed 
using line-of-sight information rather than angular infor- 
mation. 



B. Fisher Information and Error Estimates 

In the top panel of Figure [7| we return to the fore- 
groundless scenario where C = I and plot the diagonal 
elements of the Fisher matrix (which should roughly be 
thought of as being inversely proportional to the square 
of the power spectrum error bars — see Section II B ). The 



Fisher information increases from the bottom left to the 
top right, and in the middle of the k_\_-k\\ plane, their con- 
tours have a (logarithmic) slope of —2. To understand 
this, consider the process by which one forms a power 
spectrum in the k±-k\\ plane: the three-dimensional real- 
space data cube is Fourier transformed in all three spatial 
directions, and then binned and averaged over concentric 
annuli in /c-space. The averaging procedure has the ef- 
fect of averaging down noise contributions to the power 
spectrum estimate, and thus as long as the dominant 
source of error is instrumental noise, the error bars on a 
particular part of the power spectrum will decrease with 
increasing annulus volume. With /c-bins of equal logarith- 
mic width (see Section IV), the volume increases twice 
as quickly when moving to regions of higher k± in the 
(logarithmic) k±-k\\ plane than when moving to regions 
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FIG. 7: A plot of the diagonal elements of the Fisher informa- 
tion matrix for an unrealistic situation with no foregrounds 
(top panel) and with foregrounds cleaned by the inverse vari- 
ance method (bottom panel). The (unnormalized) contours 
increase in value from the bottom left corner to the top right 
corner of each plot, and are chosen so that crossing every two 
contours corresponds to an increase by a factor of 10. Dashed 
lines with logarithmic slopes of —2 are included for reference. 



of higher /cy , giving a logarithmic slope of —2 for contours 
of equal Fisher information. Conversely, Fisher informa- 
tion contours of slope —2 imply that one's error bars are 
dominated by instrumental noise. 

At high k\\ or high k_\_, the contours shown in the top 
panel of Figure [7| are seen to deviate from this behavior, 
although because of the range of the plot, the effects 
are somewhat visually subtle. At high k\\ the contours 
get slightly steeper, implying a lower information content 
(larger error bars) than one would expect if limited by 
instrumental noise. This is due to the fact that at very 
high k\\ there is insufficient spectral resolution to resolve 
the extremely fine small scale modes, resulting in power 
spectrum estimates with large error bars there. Finite 
angular resolution has a similar effect in the k± direction. 



With foregrounds, the Fisher information takes a 
markedly different form, as seen in the bottom panel of 
Figure [7| where we have plotted the Fisher information 
after carrying out inverse variance foreground subtrac- 
tion. One sees that only small regions of the k±-k\\ plane 
have contours with a logarithmic slope of —2, suggesting 
that there are few — if any — parts that are dominated 
solely by instrumental noise. 

The effects of foreground subtraction are more easily 
interpreted if one instead considers the following quan- 
tity^^, which has units of temperature squared: 



27r2 



Pr(fc±,fc||). 



This is exactly analogous to the quantity 

fc3 



A^(fc) 



27r2 



P{k) 



(38) 



(39) 



for galaxy surveys, which quantifies the contribution to 
the density field variance from a particular logarithmic 
interval in k, and 



2n 



Ce 



(40) 



for CMB measurements, which quantifies the contribu- 
tion to temperature variance per logarithmic interval in 
i. The quantity A^icm ^^^^ measures the contribution 
to 21 cm brightness temperature variance per logarithmic 
interval in (/cx, A^||)-space. 

Since Equation \i7\ tells us that the error bar on a par- 
ticular band of Pt(^±, ^||) is given by (Faa)~^^'^ for the 
inverse variance method, we can estimate the error on 
A2icm by computing the quantity^^ 



e{k±,k\\) 



27r2 V 



1 



(41) 



which, having units of temperature, is convenient for 
gauging the quality of our foreground cleaning. Note 
that this expression can also be used for the noiseless 
case (since C oc I can be considered a special case of the 



The expressions given by Equations |38| and |40| are often referred 
to as "dimensionless power spectra" . This is of course somewhat 
of a misnomer, because both expressions carry dimensions of 
temperature squared, and are "dimensionless" only in the sense 
that all units of length have been canceled out. 
This expression is in fact an upper limit on the error. That it is 
not exact arises from our taking of the square root of Equation 
|38| to obtain a quantity with temperature units. In the limit of 
small errors, for instance, Taylor expanding Equation |38| with a 
perturbation suggests that we ought to have a factor of 1/2 in 
front of our current form for £(k±, fcy ). It is only when the errors 
dominate the signal (such as at low k\\ — see Figure [Sj that e 
tends to the expression given in Equation |41| For our illustra- 
tive purposes, however. Equation suffices as a conservative 
estimate. 



16 



inverse variance method) as well as for the line-of-sight 
foreground scheme (as long as one uses the pseudoinverse 
and forms the Fisher matrix using Equation 25, as out- 
lined in Section II C). 



No foregrounds 



Alternatively, we can think of the quantity e intro- 



duced in Equation 41 as the real-world degradation fac- 
tor^ because it can also be interpreted as the ratio (up 
to an overall dimensionful normalization factor) of the 
actual error on the power spectrum Pt(^±7^||) to the 
error that would be measured by a noisy but otherwise 
ideal experiment with infinite angular resolution and in- 
finite spectral resolution. In other words, £ tells us how 
much larger the error bars get because of real-world is- 
sues like foregrounds and limited resolution. To see this, 
note that such an ideal experiment would have a constant 
£, because the k\k\\ factor essentially nulls out the geo- 
metric effects seen in the top panel of Figure [7| when one 
is not limited by resolution. 

In the top, middle, and bottom panels of Figure |8j 
we show £ for the noise-only scenario, the inverse vari- 
ance method, and the line-of-sight method, respectively. 
With the foreground-free case in the top panel, we see 
that the error bars increase as one goes from the bottom- 
left corner (low /c^, low k\\) to the top-right corner (high 
/c^, high k\\). The instrument's finite angular resolution 
causes the errors to increase in the direction of increasing 
A:_L, whereas in the direction of increasing k\\ this is due to 
the finite spectral resolution. The smallest error bars on 
the plot are < 10 mK whereas the largest are ~ 20 mK. 

Introducing foregrounds causes the errors to increase 
everywhere on the k^-k\\ plane, even after inverse vari- 
ance cleaning (middle panel of Figure [8|. However, in 
addition to the effects of finite angular resolution and fi- 
nite spectral resolution, we see that there are now also 
significant errors at low k\\ (indeed, this is where the er- 
rors are largest). This is due to residual foreground con- 
tamination — being spectrally smooth, the foregrounds 
have their greatest effect at low /cy , and thus that region 
of the k^-k\\ plane is the most susceptible to inaccura- 
cies in foreground cleaning. The "downhill" gradients of 
the contours there point to higher /cy , suggesting that the 
foregrounds are the dominant source of error. 

Considering all the effects together, one sees that the 
"sweet spot" for measurements of A2icm (or, alterna- 
tively, where a realistic experiment compares the most 
favorably to an ideal experiment for measuring the power 
spectrum Pt) is at low k±_ (to avoid being limited by an- 
gular resolution) and intermediate k\\ (to avoid being lim- 
ited by spectral resolution or residual foreground contam- 
ination). In such a region the typical errors are ^10 mK, 
whereas in the resolution-limited region at the top-right 
corner of the plot the errors are ~ 30 mK and in the 
foregrounds and angular resolution-limited region in the 
bottom-right corner the errors are ~ 130 mK. 

Although the features are seen to be qualitatively sim- 
ilar when we move from the inverse variance method to 
the LOS method (bottom panel of Figure [8|, quantita- 
tively we see that the errors have grown yet again, regard- 
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FIG. 8: Expected power spectrum error bars for a situation 
with no foregrounds (top panel, Equations 13 and 
C oc I, scaled to match the noise-dominated k±-k\\ 
of the other scenarios) 
panel, Equations 



with 
regions 

the inverse variance method (middle 



and 
and 



41 ); the line of sight method (bottom 
41 ). For the last two plots, the errors 
high k\\ , and high k± due to residual 



panel, Equations 
are high at low k\\ 
foregrounds, limited spectral resolution, and limited angular 
resolution, respectively. 
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less of one's location on the k±-k\\ plane. The smallest er- 
rors (found in the low k±, intermediate k\\ "sweet spot") 
remain < 10 mK, and the errors in the resolution-limited 
region (high k±^k\\) remain ~ 30 mK. However, the fore- 
ground contaminated regions take up a larger portion of 
the k±-k\\ plane, and the most contaminated parts of the 
plot have errors of ~ 900 mK. 



Inverse Variance method 



As explained in Sections II B and II C , the higher errors 
are to be expected with the LOS scheme, given that the 
method does not make full use of the available data, hav- 
ing projected out the low-order polynomial modes. Put 
another way, the broad window functions in the middle 
panel of Figure [2] indicate that information has been lost 
and that power has been filtered out of the data. Thus, 
in order for the power spectrum estimate to not be bi- 
ased low (because of the power loss), it is necessary to 
multiply by a large normalization factor. This normaliza- 
tion factor is derived by imposing the W^a = 1 window 
function constraint (see [45 ), and manifests itself as the 
¥aa factor in the denominator of Equation [16] — if much 
power has been filtered out of the mode corresponding 
to the /c-space index a, the remaining Fisher information 
Faa in that mode will be low, and one must divide by 
this small number so that the final estimate is of the right 
amplitude. However, this also has the effect of magnify- 
ing the corresponding error bars, leading to the larger 
errors in the bottom panel of Figure [Sj 

One may be initially puzzled by the trends shown in 
the last two panels of Figure [S) which at first sight appear 
to differ from plots like those shown in Figure 9 of [39 . In 
particular, in [39 the post-subtraction foreground resid- 
uals are the lowest at low k± and low /cy, which seem 
to imply that the cleanest parts of the final power spec- 
trum ought to be there. This, however, is a misleading 
comparison. A plot of residuals like that in ^39^ should 
be viewed as a plot of the expected size of the measured 
power spectrum (containing both the cosmological signal 
and the residual foregrounds), not the size of the error 
bars associated with the measurement, which are shown 
in Figure [Sj Put another way, the power spectrum is not 
truly clean at low k± and low /cy , for even if the measured 
values are small, the error bars are large, rendering that 
part of the power spectrum unusable. 



C. Residual Noise and Foreground Biases 

In Figure |9) we show the residual noise and foreground 
bias terms (Equations [g] and 22, but cast in temperature 
units in a way similar to Equation [41] ) for the inverse 
variance (top panel) and LOS (bottom panel) subtrac- 
tion methods, respectively. The biases that need to be 
subtracted in both methods are seen to be qualitatively 
similar, and the trends shown in Figure [9] can be readily 
understood by thinking of the bias subtraction in Equa- 
tion [2] as a foreground subtraction step in its own right. 
Acting on the k±-k\\ plane directly, this step seeks to 
remove any residual foreground power from the power 
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FIG. 9: Bias term as a function of k± and k\\ for the inverse 
variance method (Equatio n [6[ top panel), and for the line- 
of-sight method (Equation 22 bottom panel). The LOS plot 
has been artificially normalized to match the inverse variance 
plot at values of medium k± and k\\, where we expect the two 
techniques to be extremely similar. 



spectrum itself. It therefore has its biggest effect where 
foreground residuals are expected to be large, and in Sec- 
tion [V| we will develop a simple toy model to gain intu- 
ition for how this works. 



V. AN INTUITIVE TOY MODEL 

In this section, we develop a simple toy model for un- 
derstanding the inverse variance foreground subtraction 
and power spectrum estimation scheme. The goal here is 
not to reproduce the exact results of the previous section, 
but rather, to provide intuitive explanations for how the 
power spectrum estimation algorithms work. 

For our toy model, we consider a one-dimensional sur- 
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vey volume in the line-of-sight direction. This choice is 
motivated by the fact that it is generally the spectral 
rather than spatial dependence of foregrounds that is 
most useful for foreground subtraction, and consequently 
it is the algorithm's behavior as a function of frequency 
that is the most interesting. We again take the frequency 
range of the survey to be small. 

We employ a one-dimenstional covariance of the form 



(42) 



where we have written a continuous covariance instead of 
a discretized covariance matrix because we will be work- 
ing in the continuous limit in this section in order to 
eliminate pixelization effects. The foregrounds are mod- 
eled by R{y^v'\ is a normalization constant to be 
determined later, and the noise is modeled by the delta 
function. Without loss of generality, we can select units 
such that the r.m.s. foreground covariance is equal to 
unity at all frequencies i.e. R{y^v) = 1 for all u. This 
amounts to dividing the line-of-sight covariance r(z/, u') 
of Section |III| by a frequency-dependent normalization 
a{u)a{i'^)^ where 



-Q!0+cr^ ln(zy/zy*) 



givmg 



R{v^ v') = exp 



2 ^ ^ 



(43) 



(44) 



i.e. a Gaussian in logarithmic frequency. Since we are as- 
suming a narrow frequency range, the logarithmic terms 
can be expanded about z^*, giving 



i?(z^, v') ^ Riv — v') = exp 



/^2 



2vl 



(45) 



where we have defined the foreground coherence length^^: 



1 



(46) 



Note that the foreground contribution now depends on 
just the difference between v and i.e. it is translation 
invariant in frequency space. The same is true for the 
delta function noise term, which accurately captures the 



fact that the instrumental noise is uncorrelated between 
different frequencies. For simplicity, we model the noise 
as being white in our chosen units^^. Putting everything 
together, we see that our toy model is one that is defined 
by the covariance 



7 



exp 



2v1 



(47) 



where we have redefined our normalization constant so 
that the noise and foreground pieces are individually nor- 
malized to unity, allowing the constant 7 to be inter- 
preted as a foreground-to-noise ratio. 

Let us now apply the inverse- variance power spectrum 



estimation method of Section JIB to our toy model. In- 
terpreting C{y — v'^ as an integral kernel, an inverse- 
variance weighting of the data corresponds to applying 
the inverse kernel C~^{y — v'\ By the convolution the- 
orem, a translation invariant kernel acts multiplicatively 
when written in its dual Fourier basis^^. This makes the 
computation of the inverse kernel straightforward — one 
simply computes the reciprocal of the Fourier transform 
of C . Defining 77 to be the Fourier dual of the inverse 
kernel can be shown to take the form of a logistic curve 



m ri 



2. 




L 2 2 



(48) 



where = 1420 MHz is the rest frequency of the 21cm 
line, and Hq^ and c have their usual meanings. This 



kernel is plotted in Figures [10] and [IT] In Figure [TO) we 
keep the foreground coherence length Uc fixed at 0.5 MHz 
(chosen to calibrate our toy model against the results 
of Section |IV[ ) and vary the foreground-to-noise ratio 7 
from to 10 , whereas in Figure [ll] we keep 7 fixed at 
10^ (typical of first-generation 21cm tomography exper- 
iments with 1000 hrs of integration time) and vary Uc- In 
both instances the fiducial experimental cases are plotted 
using solid black curves. 

From the plots, ones sees that the action of the fore- 
ground cleaning kernel is to suppress low-Zcy modes 
in the final power spectrum estimate. Foreground sub- 
traction in the inverse variance scheme thus amounts 



While this definition of the foreground coherence length is what 
follows from Taylor expanding R{u^u')^ we caution that Equa- 
tion ^6] should not be taken too literally as a way to compute h>c- 
Doing so gives rather long coherence lengths, over which the lin- 
earized version of R{y^ v') becomes a bad approximation. Indeed, 
it is somewhat unphysical for Uc to depend on i/*, which after 
all was just an arbitrary frequency in the power-law foreground 
parameterizations of Section [Hi] In what follows, we will choose 
a fiducial value of Uc = 0.5 MHz to "anchor" our toy model to 
the results of Section lIVI 



This is of course not strictly correct, but suffices for the illustra- 
tive purposes of this toy model, especially given the narrow fre- 
quency ranges required for power spectrum estimation in 21 cm 
tomography. 

Strictly speaking, this is only true if the kernels are integrated 
from —00 to 00, which is somewhat unphysical since v > and 
moreover must be within the frequency range of one's instrument. 
However, since the foreground kernel decays quickly away from 
1/ = I/' ^ we make this approximation for the purposes of our toy 
model 
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FIG. 10: A plot of the foreground cleaning kernel C ^ (Equa- 



tion 



48| for different values of the foreground-to-noise ratio 

at 7 = 1 for the 



7, set at 7 = for the purple/dotted curve 
blue/dashed curve, at 7 = 100 for the red/dot-dashed curve, 
and at 7 = 10^ for the black/solid curve. In all cases, the 
foreground coherence length i^c — 0.5 MHz. The black/solid 
curve is intended to be representative of a first-generation 
21 cm tomography experiment. 
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FIG. 11: _A plot of the foreground cleaning kernel C 



(Equation 48 ) for different values of the foreground coherence 
length i/c, with Vc = 0.1 MHz for the purple/dotted curve, at 
Vc — 0.3 MHz for the blue/dashed curve, at Vc — 0.5 MHz for 
the black/solid curve, and at Vc = 1.0 MHz for the red/dot- 
dashed curve. In all cases the foreground-to-noise ratio 7 is 
fixed at 10^. The black/solid curve is intended to be repre- 
sentative of a first-generation 21 cm tomography experiment. 



to applying a high-pass filter in the frequency /line-of- 
sight direction. The highest k\\ modes are left untouched, 
whereas the lowest k\\ modes are suppressed by a factor 
of 1 + 7. This can be seen in Figure [lOj where there 
is no suppression when 7 = 0. Physically, this is be- 
cause with 7 = there are no foregrounds and one only 
needs to deal with instrumental noise. Since noise contri- 
butions from different frequencies are uncorrelated, the 
effect of noise is best mitigated through simple averaging 
and binning, which is accomplished by the C^q, piece of 



Equation 16 and not by the inverse variance weighting. 
The kernel is thus flat when 7 = 0. At high values 
of 7, the low-/c|| modes are essentially taken out com- 
pletely because those are precisely the modes that are 
heavily contaminated by foregrounds. In intermediate 
regimes, there is partial suppression of the low-Zcy modes 
to mitigate the foregrounds, but a complete suppression 
is unnecessary because the foregrounds are relatively low 
in amplitude; thus, useful cosmological information may 
still be extracted from these modes. 

To understand the dependence of on the coherence 
length it is useful to define a cutoff scale k^^^ . We 
define this scale to be the scale at which the k\\ modes are 
"half-suppressed" , with a weighting that is the arithmetic 
mean of (1 +7)"^ (the weighting at k\\ = 0) and 1 (the 
weighting at ku = 00). A little bit of algebra reveals that 



Ho 



f2i/V21n(2 + 7). (49) 



For k\\ <C /cy^^, the modes are severely contaminated 
by foregrounds and are largely thrown out, whereas for 
k\\ ^ /cy^^ the foregrounds are minimal and essentially 
no cleaning is required. Equation [49] and Figure [To] show 
that as foregrounds become increasingly important {i.e. 
as 7 grows), /c^^ increases. This is because with higher 
foregrounds, one is forced to clean to higher k\\ modes 
before estimating the power spectrum. Encouragingly, 
however, k^^^ depends only logarithmically on 7, which 
means that as the foregrounds become increasingly im- 
portant, only a few more modes need to be thrown out. 
For instance, as the foreground-to-noise ratio 7 goes from 
1 to 10^, /cy^^ increases by only a factor of ~ 3. In Fig- 
ure jllj and Equation 

' Physically, this is because a high Vc implies 



we see that as Vc increases, /cfj^^ 

decreases. 

foregrounds that are very spectrally coherent, which al- 
lows the foreground cleaning to be confined to a handful 
of large scale (small k\\) modes. 

As noted in Section |TI| subtracting foregrounds using 
the inverse variance method is a two-step process. The 
first step is a linear one that acts on the data itself, and 
we have seen in this section that this first step can be 
roughly understood as a high-pass filter in the frequency 
direction. The second step involves foreground mitiga- 
tion in the power spectrum itself and amounts to the sub- 
traction of the residual noise and foreground bias term 
ho, = tr[C^Q;C~^]. This term can be approximated in our 
toy model by replacing the discrete trace by a continuous 
integral: 

6„ = tr[C,„C-i] = ^(C,„),,C-/ 

« j C^a{j^i,Uj)C~^{iyj,Ui)duiduj 

where we used the fact that in one dimension. 
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With the final integral being 



precisely the Fourier transform of C ^, we see that the 
bias term that needs to be subtracted off is proportional 



to the C ^ function given by Equation 48 Intuitively, 



this can be thought of as a subtraction of the statisti- 
cally expected foreground residuals in -space. From 
the form of Equation 



48 



we can see that at low k\\ — 
where the linear foreground subtraction step acts more 
aggressively — the residuals are expected to be small, 
and the extra foregrounds that need to be removed in 
-space are small. 

Using similar manipulations, one can see that the bias 
term acts in a similar way for the LOS polynomial sub- 
traction. From Equation 22 we have = tr[C^Q,DCD] 
for the LOS subtraction. 



where recall that D was the 
projection matrix responsible for projecting out the low- 
est order polynomials in the frequency direction (which 
hopefully contain most of the foregrounds). Comparing 
this to Equation 50 tells us that one simply needs to sub- 
stitute for DCD. This combination can be rewrit- 
ten as 



DCD = (D(x - m)(x - m)^D^), 



(51) 



which is the covariance of the residual foregrounds af- 
ter LOS subtraction. Conceptually, then, the residual 
bias term works in the same way as it did in the inverse 
variance subtraction — one is simply subtracting off the 
statistically expected foreground residuals in power spec- 
trum space. 

Finally, we can make a similar estimate of the Fisher 
information for the inverse variance method. Starting 



from Equation 13 , we have 



= -tr [C,«C-iC ;3C-^] 



oc 



1 + 7 exp 



(52) 



which looks exactly like the curves plotted in Figure [Toj 
but squared. This shows that the Fisher information 
content is much higher at high values of k\\ than at low 
values of . In other words, after foreground subtraction 
removes power from large scales (small /cy) there is little 
information left there and the error bars are large. This is 
in accordance with the middle panel of Figure [S) where 
the expected errors are seen to decrease as one moves 
away from the lowest /cy values. Note, however, that the 
finite spectral resolution effects that we saw in the high 



k\\ regions of Figure Isl are not captured by Equation [52j 
This is to be expected, since here we have a toy mooel 
whose covariance is a continuous function of frequency. 

The results of this section suggest that the inverse vari- 
ance foreground subtraction can be qualitatively thought 
of as a scheme where the data are high-pass filtered in the 
frequency direction. This can be done by applying a mul- 
tiplicative filter in space, an example of which is given 
by Equation |48j In fact, such line-of-sight filtering can 
be considered a computationally cheap approximation to 
the full inverse variance method, requiring just a simple 
multiplication instead of the inversion of 
covariance matrix. The results should be almost as good, 
since we saw from the bottom panel of Figure [2] that even 
in the full method, the algorithm "chooses" to perform 
foreground subtraction almost exclusively in the line-of- 
sight direction. 

Filtering in the line-of-sight direction is an approach 
that is extremely similar to one that was proposed in 
[46]. There, the authors suggest cleaning foregrounds by 
excluding all modes with /cy < /cy^**, where /cy^*^ is some 
critical line-of-sight mode below which foregrounds are 
expected to dominate. This corresponds to applying a 
step- function filter, and in [46 it was found that this 
dramatically reduces systematic biases in the spherically 
averaged 3D power spectrum. However, the authors also 
caution that some foregrounds may have some small (but 
nonzero) high /cy component that will not be alleviated 
by a sharp filter. This problem is solved by instead using 
the filter C~^{k\\) given in Equation 48 From Figures 
10 and 11 one sees that the filter is qualitatively similar 
to a step function, but the fact that the filter is itself 
dictated by a foreground model allows it to enact slight 
suppressions of foregrounds even at high k\\ modes. 



VI. DISCUSSION AND CONCLUSIONS 

In this paper, we have presented a unified matrix-based 
formalism for 21 cm foreground removal and power spec- 
trum estimation. Using this framework, we compared ex- 
isting line-of-sight (LOS) polynomial foreground subtrac- 
tion schemes with a proposed inverse variance method, 
quantifying their errors and showing how to eliminate 
their biases. We now review our basic results, and fol- 
low with a discussion of their general applicability before 
summarizing our conclusions. 



A. Basic Results 

Through the numerical case study performed in Sec- 
tion [TVl we have established a number of qualitative re- 
sults: 

1. LOS polynomial foreground subtraction is non- 
optimal, and gives rise to larger error bars in fi- 
nal power spectrum than inverse- variance subtrac- 
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tion does. This is mostly due to the fact that 
the LOS method projects out low-order polyno- 
mial modes, destroying information. The inverse 
variance method, on the other hand, preserves all 
modes, even if it does downweights some of them 
substantially. As a result, the error bars on the fi- 
nal power spectrum are larger for the LOS method 
(bottom panel of Figure [8| than for the inverse vari- 
ance method (middle panel of Figure [8|. 

2. Traditional LOS polynomial subtraction methods 
contain residual noise and foreground biases in es- 
timates of the power spectrum. Fortunately, this 
bias can be easily quantified and removed using 
Equation [6j 

3. In the low /cy regions of the k±-k\\ plane, the LOS 
polynomial foreground cleaning gives rise to power 
spectrum estimates that are highly correlated be- 
tween neighboring k\\ bins, or equivalent ly, to win- 
dow functions that are elongated in the /cy direc- 
tion. The LOS polynomial subtraction has no effect 
on the k± direction and exhibits the same qualita- 
tive behavior regardless of the value of k± since it 
does not use angular foreground information in its 
cleaning. 

4. The behavior of the inverse variance window func- 
tions, on the other hand, depends strongly on the 
value of k±: 

(a) In regions where neither k± nor k\\ is small, the 
post-inverse variance subtraction power spec- 
trum estimates are no more correlated than 
what would be expected from the finiteness of 
the survey geometry alone. In other words, 
the inverse variance window functions in this 
region are unchanged by the presence of fore- 
grounds. This is due to the fact that fore- 
ground contamination was mild to begin with, 
resulting in a less aggressive foreground sub- 
traction by the inverse variance weighting in 
order to keep power spectrum estimates as un- 
correlated as possible. 

(b) In regions where both k_\_ and /cy are small, 
the inverse variance window functions widen 
in both directions, but the effect in the kj_ 
direction is minimal. This indicates that in- 
verse variance subtraction operates mainly in 
the radial direction, even though this is not an 
a priori mathematical constraint. Put another 
way, the nature of the foregrounds and the 
availability of high spectral resolution instru- 
ments makes it much more fruitful to lever- 
age frequency information (rather than an- 
gular information) in performing foreground 
subtraction. This confirms intuitive expecta- 
tions in the existing literature. 



(c) The inverse variance window functions show 
less k\\ widening than the LOS polynomial 
window functions do, regardless of the loca- 
tion on the k±-k\\ plane. This suggests that 
with the inverse variance scheme, one should 
be able to push to lower values of k\\ be- 
fore correlations between neighboring /cy-bins 
make the resulting power spectrum estimates 
scientifically unusable. 

In summary, the inverse-variance method that we have 
presented has several advantages over the LOS poly- 
nomial subtraction methods proposed in the literature: 
they produce power spectrum measurements with smaller 
and less correlated error bars as well as narrower window 
functions. 



B. Applicability of Results to general arrays 



While the results presented in Section IV 



were com- 



puted for a set of specific MWA-like array parameters 
and a specific foreground model, we emphasize that the 
formalism presented in Section [ll] is completely general 
and can be applied to any 21 cm tomography experiment 
and any foreground model. In addition, as we will now 
argue, our qualitative conclusions from the last subsec- 
tion should in fact hold quite generally. 

Consider first an experiment's location, antenna lay- 
out, integration time, rotation synthesis, and thermal 
noise level. These factors may initially seem to form a 
high-dimensional parameter space that needs to be thor- 
oughly explored, but from Section [ll] we know that the 
only place where they enter our mathematical framework 
is Equation |36j which determined our noise covariance 
matrix N. Their effects are thus degenerate with each 
other, and we saw in Section |V| that the resulting differ- 
ences in the power spectrum estimation can be captured 
by a single parameter: the foreground-to-noise ratio 7. 

Changing the frequency range of one's data cube has 
several effects, one of which is to alter the instrumen- 
tal noise of the array, depending on the chosen fre- 
quency binning of the data. This, too, is captured by 
the foreground-to-noise parameter. More subtly, an in- 
creased frequency range changes the polynomial fit in the 
LOS scheme and affects foreground subtraction. While 
this can result in large changes to the final power spec- 
trum, it was shown in [40 that for LOS subtraction to 
be effective in the first place, one must perform the fits 
only over narrow frequency ranges anyway. Moreover, as 
we have discussed above, cosmological evolution also im- 
poses a constraint on the frequency range of any one data 
cube (one may of course separately analyze as many data 
cubes as necessary to cover the full data set). Thus, the 
small frequency range used in Section [IV| is a good range 
to use, regardless of the full bandwidth of one's instru- 
ment. For the inverse variance method, our conclusions 
are even more general than for the LOS method. As we 
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saw in the previous section, inverse variance subtraction 
can be thought of as a multiphcative filter in k\\ space, 
and increasing the frequency range simply increases the 
resolution in /cy -space without affecting the overall shape 
of the filter. 

Another tunable parameter in our analysis is the flux 
cut Scut above which the bright point sources can be con- 
sidered identifiable and removable in an earlier stage of 
data analysis. In principle, varying Scut can affect the re- 
sults in a complicated way, but from [40 , we know that 
the amplitude of foregrounds essentially scales with Scut^ 
since the foregrounds are dominated by the brightest 
sources in a population. Therefore, Scut is simply yet an- 
other parameter that can be folded into the foreground- 
to-noise ratio 7. 

C. Future challenges 

In this paper, we have shown that inverse variance 
foreground subtraction is able to clean foregrounds more 
effectively than traditional LOS methods can, resulting 
in power spectrum estimates with smaller error bars. 
Perhaps the only disadvantage of the inverse variance 
method is its computational cost, since it requires the 
inversion of an ripix x Upix matrix, where Upix ~ n^ny is 
the total number of pixels in one's data cube. An LOS 
polynomial algorithm, on the other hand, only requires 
the inversion of an ny x ny matrix. Since matrix inversion 
naively scales as O(n^), the difference is substantial. 

Fortunately, by making a number of reasonable ap- 
proximations one can speed up the method considerably. 
As long as the fields of view are relatively narrow, a se- 
ries of basis changes and matrix prewhitening techniques 
allow the relevant computations to be performed using 
algorithms that scale as 0{npix log Upix)^ and such algo- 
rithms are now being tested numerically for 21 cm tomog- 



raphy [66|. Moreover, since the inverse variance method 
operates mostly in the radial direction, one can define 
an "effective" LOS algorithm that closely mimics the be- 
havior of the full inverse variance subtraction. This was 
demonstrated in Section [V| where we developed an ex- 
ample of such an algorithm and saw that the inverse vari- 
ance scheme can be thought of as a high-pass filter in the 
frequency direction, with the detailed properties of the 
filter dependent on noise and foreground characteristics. 

To further improve the quality of foreground subtrac- 
tion, it will be necessary to construct better foreground 
models. This is because inverse variance foreground 
cleaning is not blind, but instead depends on our em- 
pirical knowledge of foreground properties. It is impor- 
tant to note, however, that this is an advantage and 
not a disadvantage of the inverse variance method — 
whereas the traditional LOS methods (being blind) will 
not improve with better knowledge of the foregrounds, 
the inverse variance method is virtually guaranteed to 
get better as 21 cm tomography experiments begin to 
take science-quality data that are able to put tight con- 
straints on our foreground models. Combining the in- 
verse variance method with the improved measurements 
to be expected from the low frequency radio astronomy 
community should therefore bring us closer to fulfilling 
the potential of 21 cm tomography to improve our under- 
standing of the Epoch of Reionization, the Dark Ages, 
and fundamental physics. 
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